Reliable Measures for Aligning Japanese-English News Articles and Sentences

نویسندگان

  • Masao Utiyama
  • Hitoshi Isahara
چکیده

We have aligned Japanese and English news articles and sentences to make a large parallel corpus. We first used a method based on cross-language information retrieval (CLIR) to align the Japanese and English articles and then used a method based on dynamic programming (DP) matching to align the Japanese and English sentences in these articles. However, the results included many incorrect alignments. To remove these, we propose two measures (scores) that evaluate the validity of alignments. The measure for article alignment uses similarities in sentences aligned by DP matching and that for sentence alignment uses similarities in articles aligned by CLIR. They enhance each other to improve the accuracy of alignment. Using these measures, we have successfully constructed a largescale article and sentence alignment corpus available to the public.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Machine Translation of Sentences with Fixed Expressions

This paper presents a practical machine translation system based on sentence types for economic news stories. Conventional English-to-Japanese machine translation (MT) systems which are rule-based approaches, are difficult to translate certain types of Associated Press (AP) wire service news stories, such as economics and sports, because these topics include many fixed expressions (such as comp...

متن کامل

Detection of Difference between News Articles on the Same Topic Based on Sequential Comparison

Currently, a lot of news articles are published on theWeb, and it is getting easier for us to read them. However, the number of articles are too large for us to read all of them. Although some Web sites cluster/classify news articles into some topics (categories), it is not enough since a large number of articles are still in each topic. Detecting difference between articles on one topic will b...

متن کامل

Translation of News Headlines

Machine-Translation of news headlines is difficult since the sentences are fragmentary and abbreviations and acronyms of proper names are frequently used. Another difficulty is that, since the headline comes at the top of a news article, the context information useful to disambiguate the sense of words and to determine their translation(target word) is not available. This paper proposes a new a...

متن کامل

Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria

We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written ...

متن کامل

Automatic Alignment of Japanese and English Newspaper Articles using an MT System and a Bilingual Company Name Dictionary

One of the crucial parts of any corpus-based machine translation system is a large-scale bilingual corpus that is aligned at various levels such, as the sentence and phrase levels. This kind of corpus, however, is not easy to obtain, and accordingly, there is a great need for an efficient construction method. We approach this problem by integrating two large monolingual corpora in two different...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003